For more background in statistics check out the following books:
All presented material is under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
# Clear the workspace
rm(list = ls())
# List of package needed for this workshop
reqpkg <- c("ggplot2","jpeg", "Rtsne", "randomForest", "MASS")
# Check if the packages are installed:
inpkg = installed.packages()[, "Package"] #installed packages
neededpkg = reqpkg[!reqpkg %in% inpkg]
if(length(neededpkg) > 0){
stop(paste("\n Need to install the following package:", neededpkg))
}
If you haven’t done that already, download the workshop materials:
url_to_class_materials <- "https://stanford.box.com/s/sqmpjdbuxa29a8fenm1cqxc0ri6c3276"
# Change the path directory and uncomment the line below
#setwd("/path/to/dir/")
download.file("url_to_class_materials", "R_workshop.zip")
unzip("R_workshop.zip")
Why do you need a model? You would like to:
In this workshop you won’t learn mathematical details about how the methods present work or how to create a good statistically-sound model. I will just equip you with the knowledge of how to use some of the R packages that implement the more popular statistical techniques, and how to interpret their outputs.
You can check out other workshops e.f statistical learning or machine learning to learn more on theory these methods are based on.
To illustrate how statistical modelling works in R we will use the mtcars, built-in dataset, that comes from a 1974 issue of Motor Trends magazine.
data("mtcars")
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars represents one model of car, indicated in the row-names.One often asked question in data analysis is whether there is a significant difference between two samples, or groups. For example, one be interested in whether a group of patients subject to a medical treatment had better outcomes than a control group. These kinds of hypotheses can be tested with statistical procedures.
Here we will show how to perform a Student’s T-test, which is one of the methods statisticians traditionally use to test the equality of the means of two populations.
Analyzing our example dataset we might be interested whether the fuel efficiency, i.e. mileage per gallon is different for the cars with automatic and manual transmission. The example and the code in the following testing and and linear regression sections is adapted from the one found here.
First we can visualize the observations using the boxplot. Th data about the transmission is stored in the am column and on the fuel efficiency in the mpg column.
data("mtcars")
mtcars$am <- factor(mtcars$am, levels = c(0, 1),
labels = c("automatic", "manual"))
ggplot(mtcars, aes(x = am, y = mpg)) + geom_boxplot() +
xlab("Trasmission") + ylab("Fuel efficiency")
To test whether the mean mileage per gallon is different between the cars with automatic and manual transmission, we can use the t.test() function:
(tt <- t.test(mpg ~ am, data=mtcars))
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group automatic mean in group manual
17.14737 24.39231
Here the first argument is the formula = mpg ~ am. In R a tilde symbol, ~ means “explained by”. That is we want to test the miles per gallon explained by the automatic transmission. The second argument is the dataset we choose, mtcars.
We can extract the information computed for the t-test stored in the following elements of tt.
names(tt)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "alternative" "method" "data.name"
We can extract the p-value, that is the probability that we would observe a more extreme difference in the sample means given that both samples come from the same population. In this case it means that if indeed the automatic and manual transmission cars have equal average mileage per gallon rate, then the probability that we would observe a greater or equal difference in the sample mean mtg is equal to p-value.
tt$p.value
[1] 0.001373638
The 95% confidence interval for the difference between the two means is stored in:
tt$conf.int
[1] -11.280194 -3.209684
attr(,"conf.level")
[1] 0.95
Supervised Learning is a task of inferring the relationship between the input data and the response variable.
In supervised learning each observation consist of a pair \((x, y)\) where \(x\) is the input data, e.g. a collection of the attributes, and \(y\) the label or the output value.
Supervised learning problems can be divided into classification and regression tasks.
Regression: \(y\) is a quantitative variable (numerical/continuous/ordered) e.g. mileage per gallon in the mtcars dataset
Classification: \(y\) is a qualitative variable (categorical)
e.g. transmission in the mtcars dataset
Linear regression is a type of regression where the quantitative output variable is modeled as a linear function of the inputs.
Simple linear regression predicts the output \(y\) from a single predictor \(x\). That is we assume the outcome \(y\) follows the following linear model, where \(\epsilon\) is a random noise term with zero mean.
\[y = \beta_0 + \beta_1 x + \epsilon\]
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon = \vec \beta \cdot \vec x + \epsilon\]
Linear regression seeks a solution \(\hat y = \hat \beta \cdot \vec x\) such that the difference between the true outcome \(y\) and the prediction \(\hat y\) (the sum of the squared residuals) is minimized.
\[arg \min\limits_{\hat \beta} \sum_i \left(y^{(i)} - \hat \beta x^{(i)}\right)^2\]
Analyzing the example dataset mtcars, we can try to model the mileage per gallon using the weight of the car. That is we would like to use the weight of the car to predict the mpg.
In R the linear models can be fit using an lm function.
fit <- lm(mpg ~ wt, mtcars)
Note that we use the same notation as for the t.test function. We can check the details on the fitted model by calling:
summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The results are organized as follows: * Call – the function call for the fitted model, * Residuals – the residuals between the true mpg and the predicted values * Coefficents – the fitted coefficients for the fitted model (the \(\beta\)’s in the linear model formula). The values, the standard errors, the t-statistics as well as the p-values are shown. The stars in the last column mark the significance of each coefficient (more stars means more importance). * Other details on the performance of the model.
Coefficients of the model can be extracted with:
(co <- coef(summary(fit)))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285126 1.877627 19.857575 8.241799e-19
wt -5.344472 0.559101 -9.559044 1.293959e-10
To predict the mpg for the existing observations (cars) you can call
predict(fit)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
23.282611 21.919770 24.885952 20.102650 18.900144 18.793255 18.205363
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
20.236262 20.450041 18.900144 18.900144 15.533127 17.350247 17.083024
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
9.226650 8.296712 8.718926 25.527289 28.653805 27.478021 24.111004
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
18.472586 18.926866 16.762355 16.735633 26.943574 25.847957 29.198941
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
20.343151 22.480940 18.205363 22.427495
To predict the mpg for the new observations, e.g. cars with specific weights, e.g. wt = 3.14 we can use the computed coefficients:
co[, 1]
(Intercept) wt
37.285126 -5.344472
# beta0 + beta1 * wt
co[1, 1] + co[2, 1]* 3.14 # 37.285126 + (-5.344472) * 3.14
[1] 20.50349
We can make predictions for a large number of new observations by creating a data frame with new weights:
newcars <- data.frame(wt = c(2, 2.1, 3.14, 4.1, 4.3))
predict(fit, newcars)
1 2 3 4 5
26.59618 26.06174 20.50349 15.37279 14.30390
Note that this produces the same prediction for wt = 3.14 as we computed previously.
We can plot the data and the fitted model using ggplot2. This can be done even without computing the model in the previous step.
The geom_smoother with the method argument set equal to "lm" does the computations automatically.
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(method="lm")
More elaborate models involving more input variables can be fitted using the same function lm().
For example we can be interested in predicting the mpg from weight, displacement and the number of cylinders in the car.
Look at the data:
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + geom_point()
mfit1 <- lm(mpg ~ wt + disp + cyl, data = mtcars)
summary(mfit1)
Call:
lm(formula = mpg ~ wt + disp + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.4035 -1.4028 -0.4955 1.3387 6.0722
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.107678 2.842426 14.462 1.62e-14 ***
wt -3.635677 1.040138 -3.495 0.00160 **
disp 0.007473 0.011845 0.631 0.53322
cyl -1.784944 0.607110 -2.940 0.00651 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.595 on 28 degrees of freedom
Multiple R-squared: 0.8326, Adjusted R-squared: 0.8147
F-statistic: 46.42 on 3 and 28 DF, p-value: 5.399e-11
The same commands as before can be applied to summary(mfit1) to extract the information.
To predict mpg for new values cars first create a data frame of new inputs per car:
newcars <- data.frame(wt = c(2, 2.1, 3.14, 4.1, 4.3),
disp = c(100, 200, 500,300, 210),
cyl = c(6,6,4,6,8))
predict(mfit1, newcars)
1 2 3 4 5
23.87395 24.25768 26.28834 17.73362 12.76403
Models with interaction effects can be specified with *:
mfit2 <- lm(mpg ~ am * wt, mtcars)
summary(mfit2)
Call:
lm(formula = mpg ~ am * wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.6004 -1.5446 -0.5325 0.9012 6.0909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.4161 3.0201 10.402 4.00e-11 ***
ammanual 14.8784 4.2640 3.489 0.00162 **
wt -3.7859 0.7856 -4.819 4.55e-05 ***
ammanual:wt -5.2984 1.4447 -3.667 0.00102 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
Nowadays, working with real datasets we can encounter a problem of having “too” many variables. One example of such a problem is prediction the risk of a disease based on the single nucleotide polymorphisms (SNPs) data. In this case the number of predictions can be much larger than than the number of observations. Lasso regression is especially useful for problems, where the number of available covariates is extremely large, but only a handful of them are relevant for the prediction of the outcome.
Lasso is simply regression with \(L_1\) penalty. That is we are interested in finding a solution to the following problem:
\[arg \min\limits_{\hat \beta} \sum_i \left(y^{(i)} - \hat \beta x^{(i)}\right)^2 + \lambda \|\hat \beta\|_1\]
It turns out that the \(L_1\) norm \(\|\vec x\|_1 = \sum_j |x_j|\) promotes sparsity which means that the solutions found will usually have only a small number of non-zero coefficients. The number of non-zero coefficients depends on the choice of the tuning parameter, \(\lambda\). The higher the \(\lambda\) the fewer non-zero coefficients.
Lasso regression is implemented in an R package glmnet. An introductory tutorial to the package can be found here.
We again use the mtcars datasets for the following example. Now we will build a model predicting the mpg using the Lasso regression. We supply all the available covariates and let the algorithm figure out fit
library(glmnet)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 manual 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 manual 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 manual 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 automatic 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 automatic 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 automatic 3 1
y <- mtcars[, 1] # mileage per gallon
x <- mtcars[, -1] # all other variables treated as predictors
x <- data.matrix(x, "matrix")
set.seed(123)
trainIdx <- sample(1:nrow(mtcars), round(0.7 * nrow(mtcars)))
fit <- glmnet(x[trainIdx, ], y[trainIdx])
print(fit)
Call: glmnet(x = x[trainIdx, ], y = y[trainIdx])
Df %Dev Lambda
[1,] 0 0.0000 4.679000
[2,] 1 0.1383 4.264000
[3,] 2 0.2626 3.885000
[4,] 2 0.3700 3.540000
[5,] 2 0.4593 3.225000
[6,] 2 0.5333 2.939000
[7,] 2 0.5948 2.678000
[8,] 2 0.6459 2.440000
[9,] 2 0.6883 2.223000
[10,] 2 0.7235 2.026000
[11,] 2 0.7527 1.846000
[12,] 2 0.7770 1.682000
[13,] 3 0.7993 1.532000
[14,] 3 0.8179 1.396000
[15,] 3 0.8335 1.272000
[16,] 3 0.8463 1.159000
[17,] 3 0.8570 1.056000
[18,] 3 0.8659 0.962300
[19,] 3 0.8733 0.876800
[20,] 4 0.8797 0.798900
[21,] 4 0.8862 0.727900
[22,] 4 0.8915 0.663300
[23,] 4 0.8960 0.604300
[24,] 4 0.8997 0.550700
[25,] 4 0.9028 0.501700
[26,] 4 0.9054 0.457200
[27,] 4 0.9075 0.416600
[28,] 4 0.9093 0.379500
[29,] 5 0.9108 0.345800
[30,] 6 0.9124 0.315100
[31,] 5 0.9139 0.287100
[32,] 5 0.9152 0.261600
[33,] 5 0.9162 0.238400
[34,] 5 0.9171 0.217200
[35,] 5 0.9178 0.197900
[36,] 5 0.9184 0.180300
[37,] 5 0.9189 0.164300
[38,] 5 0.9193 0.149700
[39,] 4 0.9197 0.136400
[40,] 4 0.9199 0.124300
[41,] 4 0.9201 0.113200
[42,] 4 0.9203 0.103200
[43,] 5 0.9215 0.094020
[44,] 7 0.9263 0.085660
[45,] 7 0.9313 0.078050
[46,] 6 0.9350 0.071120
[47,] 6 0.9361 0.064800
[48,] 6 0.9371 0.059050
[49,] 7 0.9379 0.053800
[50,] 7 0.9387 0.049020
[51,] 8 0.9396 0.044670
[52,] 9 0.9414 0.040700
[53,] 10 0.9443 0.037080
[54,] 10 0.9473 0.033790
[55,] 10 0.9499 0.030790
[56,] 10 0.9520 0.028050
[57,] 10 0.9538 0.025560
[58,] 10 0.9553 0.023290
[59,] 10 0.9565 0.021220
[60,] 10 0.9575 0.019330
[61,] 10 0.9584 0.017620
[62,] 10 0.9591 0.016050
[63,] 10 0.9597 0.014630
[64,] 10 0.9602 0.013330
[65,] 10 0.9606 0.012140
[66,] 10 0.9609 0.011060
[67,] 10 0.9612 0.010080
[68,] 10 0.9614 0.009186
[69,] 10 0.9616 0.008369
[70,] 10 0.9618 0.007626
[71,] 10 0.9619 0.006949
[72,] 10 0.9620 0.006331
[73,] 10 0.9621 0.005769
[74,] 10 0.9622 0.005256
[75,] 10 0.9623 0.004789
[76,] 10 0.9623 0.004364
[77,] 10 0.9624 0.003976
[78,] 10 0.9624 0.003623
[79,] 10 0.9625 0.003301
[80,] 10 0.9625 0.003008
[81,] 10 0.9625 0.002741
[82,] 10 0.9625 0.002497
[83,] 10 0.9626 0.002275
[84,] 10 0.9626 0.002073
[85,] 10 0.9626 0.001889
[86,] 10 0.9626 0.001721
[87,] 10 0.9626 0.001568
glmnet() compute the Lasso regression for a sequence of different tuning parameters.In the above table column Df denotes the number of non-zero coefficients (degrees of freedom), %Dev is the percentage variance explained, and Lambda is the value of the currently chosen tuning parameter.
plot(fit, label = TRUE)
# label = TRUE makes the plot annotate the curves with the corresponding coeffients labels.
The computed Lasso coefficient for a particular choice of \(\lambda\) can be printed using:
coef(fit, s = 1)
11 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 34.877093111
cyl -0.867649618
disp .
hp -0.005778702
drat .
wt -2.757808266
qsec .
vs .
am .
gear .
carb .
Similarly to the case with lm(), we can use the function predict() to predict the outcome from the existing or new input data. We have to however specify the choice of the tuning parameter by setting the value of the argument s.
predict(fit, newx = x[-trainIdx, ], s = c(0.5, 1.5, 2))
1 2 3
Datsun 710 25.36098 23.87240 23.22262
Valiant 19.82245 19.42427 19.41920
Duster 360 16.19324 17.27111 17.74858
Merc 230 22.62471 21.86937 21.50396
Merc 450SE 15.20595 16.16123 16.71324
Cadillac Fleetwood 11.25687 13.28117 14.26985
Chrysler Imperial 10.81730 13.01570 14.07314
Fiat 128 25.88928 24.20103 23.47110
Toyota Corolla 27.01880 25.08206 24.22690
Toyota Corona 24.89106 23.51713 22.92237
Each of the columns corresponds to corresponding choice of \(\lambda\).
The choice of \(\lambda\) can be made using cross-validation. We can us the function cv.glmnet() to perform a k-fold cross validation.
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.1
set.seed(1)
cvfit <- cv.glmnet(x[trainIdx, ], y[trainIdx], nfolds = 5)
nfolds argument sets the number of folds (k).
plot(cvfit)
The above plot shows the average MSE (red dots) over the folds, together with the standard deviation bounds (error bars) for the \(\lambda\) sequence. The two chosen \(\lambda\) values are indicated with dotted black lines.
The first one:
cvfit$lambda.min
[1] 0.2171905
is the value of \(\lambda\) that gives a minimum mean cross-validated error, and the second:
cvfit$lambda.1se
[1] 0.6632685
gives the biggest \(\lambda\) such that the MSE is within one standard error of the minimum error.
In this exercise you will perform Lasso regression yourself. We will use the Boston dataset from the MASS package. The dataset contains information on the Boston suburbs housing market collected by David Harrison in 1978.
We will try to predict the median value of of homes in the region based on its attributes recorded in other variables.
library(MASS)
?Boston
head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
dim(Boston)
[1] 506 14
set.seed(123)
trainIdx <- sample(1:nrow(Boston), round(0.7 * nrow(Boston)))
boston.test <- Boston[-trainIdx,"medv"]
Perform a Lasso regression with glmnet. Steps:
Boston data.frame and convert them if necessary to a correct format.# (... ?)
As mentioned before unlike regression, classification deals with prediction outcomes or response variables that are qualitative, or categorical.
The task is to classify or assign each observation to a category or a class.
Despite its name, logistic regression is used for classification problems.
It performs binary classification (only two categories),
The name regression is due to the fact that we are still fitting a function to the data. However, in this case, we are interested in the relationship between the input and the probability that the output is in one of the two classes and not the class assigmnent itself. This function used to model the relationship is called a logit function:
\[P[y = 1 | x] = {\exp^{\beta_0 + \beta_1 x} \over 1 + \exp^{\beta_0 + \beta_1 x}}\]
We go back to the mtcars dataset.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 manual 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 manual 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 manual 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 automatic 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 automatic 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 automatic 3 1
Using logistic regression we will try to predict whether a car had manial or automatic transmission base on its other characteristics.
In R the function glm() can be used to perform logistic regression (as well as mamy other generalized linear models which you can look up by calling ?glm)
Note that currently the column am in mtcars is a numerical variable. However, its values are only 0 or 1 denoting automatic or manual transmission respectively.
class(mtcars$am)
[1] "factor"
table(mtcars$am)
automatic manual
19 13
We need to convert that column to a factor format, so that R knows the am variable is a categorical one:
mtcars$am <- factor(mtcars$am)
Then we fit the logistic regression predicating the transmission of the car from its mileage per hour (mpg) and the gross horesepower (hp) with the following function call:
fit.logit <- glm(am ~ mpg + hp, data = mtcars, family = "binomial")
Note that the formula am ~ . means we are interested in explaining ~ the variable am with mpg and hp. You need to set the family to family = "binomial" to indicate you would like to perform a logistic regression. More on this family parameter, which describes the error distribution and the link function can be found by calling ?family.
The model summary can be printed with:
summary(fit.logit)
Call:
glm(formula = am ~ mpg + hp, family = "binomial", data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.41460 -0.42809 -0.07021 0.16041 1.66500
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -33.60517 15.07672 -2.229 0.0258 *
mpg 1.25961 0.56747 2.220 0.0264 *
hp 0.05504 0.02692 2.045 0.0409 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 19.233 on 29 degrees of freedom
AIC: 25.233
Number of Fisher Scoring iterations: 7
As we see the output is simmilar to the one for the linear model, including the function call, information on residuals, coefficient matrix etc.
Predictions can be computed again using predict() function, with the argument type = "response".
(newcars <- data.frame(mpg = mean(mtcars$mpg),
hp = c(110, 170, 155)))
mpg hp
1 20.09062 110
2 20.09062 170
3 20.09062 155
predict(fit.logit, newdata = newcars, type = "response")
1 2 3
0.09588369 0.74247121 0.55803325
In the above the output this the probability of a new car with specified mpg and hp to have a manual transmission (wich corresponds to value of 1 in the factor mtcars$am). Classification can be done by assigning “manual” to all the cars with the probability > 0.5, and “automatic” to the rest.
More on logistic regression in R can be found here
We will do a simple example from the ISL computing SVM on a simulated data:
set.seed(1)
x <- matrix(rnorm(20*2), ncol=2)
y <- c(rep(-1,10), rep(1,10))
x[y == 1,] <- x[y == 1, ] + 1
plot(x, col=(3-y))
dat <- data.frame(x = x, y=as.factor(y))
head(dat)
x.1 x.2 y
1 -0.6264538 0.91897737 -1
2 0.1836433 0.78213630 -1
3 -0.8356286 0.07456498 -1
4 1.5952808 -1.98935170 -1
5 0.3295078 0.61982575 -1
6 -0.8204684 -0.05612874 -1
library(e1071)
svmfit <- svm(y ~ ., data=dat, kernel="linear", cost=10, scale=FALSE)
plot(svmfit, dat)
svmfit$index
[1] 1 2 5 7 14 16 17
summary(svmfit)
Call:
svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 10
gamma: 0.5
Number of Support Vectors: 7
( 4 3 )
Number of Classes: 2
Levels:
-1 1
svmfit <- svm(y~., data=dat, kernel="linear", cost=0.1,scale=FALSE)
plot(svmfit, dat)
svmfit$index
[1] 1 2 3 4 5 7 9 10 12 13 14 15 16 17 18 20
set.seed(1)
To find a best choice of the tuning parameter “C” use the tune() function
set.seed(1)
tune.out <- tune(svm,y ~ ., data=dat, kernel="linear",
ranges=list(cost=c(0.001, 0.01, 0.1, 1,5,10,100)))
summary(tune.out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.1
- Detailed performance results:
cost error dispersion
1 1e-03 0.70 0.4216370
2 1e-02 0.70 0.4216370
3 1e-01 0.10 0.2108185
4 1e+00 0.15 0.2415229
5 5e+00 0.15 0.2415229
6 1e+01 0.15 0.2415229
7 1e+02 0.15 0.2415229
bestmod <- tune.out$best.model
summary(bestmod)
Call:
best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.1
gamma: 0.5
Number of Support Vectors: 16
( 8 8 )
Number of Classes: 2
Levels:
-1 1
We build a new test dataset from a similar model as we did for the train data.
xtest <- matrix(rnorm(20*2), ncol=2)
ytest <- sample(c(-1,1), 20, rep=TRUE)
xtest[ytest==1, ] <- xtest[ytest == 1,] + 1
testdat <- data.frame(x=xtest, y=as.factor(ytest))
plot(xtest, col=(3-ytest))
ypred <- predict(bestmod, testdat)
table(predict = ypred, truth = testdat$y)
truth
predict -1 1
-1 11 1
1 0 8
And for the non-tuned model we have:
svmfit <- svm(y~., data=dat, kernel = "linear", cost=.01, scale = FALSE)
ypred <- predict(svmfit, testdat)
table(predict = ypred, truth = testdat$y)
truth
predict -1 1
-1 11 2
1 0 7
Now suppose we have non-linearly separable data:
set.seed(123)
# Generate 200 points
x <- matrix(rnorm(200*2), ncol=2)
x[1:100,] <- x[1:100,]+2
x[101:150,] <- x[101:150,]-2
y <- c(rep(1,150), rep(2,50))
dat <- data.frame(x = x, y = as.factor(y))
plot(x, col=y)
# Let a random half be a training set
train <- sample(200, 100)
We will use the SVM with a radial kernel. Note that here we can additionally specify the gamma parameter for the radial function:
svmfit <- svm(y~., data=dat[train,], kernel = "radial", gamma=1, cost=1)
plot(svmfit, dat[train,])
summary(svmfit)
Call:
svm(formula = y ~ ., data = dat[train, ], kernel = "radial", gamma = 1, cost = 1)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 1
Number of Support Vectors: 35
( 19 16 )
Number of Classes: 2
Levels:
1 2
table(true <- dat[-train,"y"], pred=predict(svmfit, newx = dat[-train,]))
pred
1 2
1 58 14
2 21 7
We can tune both gamma and cost parameters:
set.seed(17)
tune.out <- tune(svm, y~., data=dat[train,], kernel="radial",
ranges=list(cost=c(0.1,1,10,100,1000), gamma=c(0.5,1,2,3,4)))
summary(tune.out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
cost gamma
10 0.5
- best performance: 0.08
- Detailed performance results:
cost gamma error dispersion
1 1e-01 0.5 0.22 0.15491933
2 1e+00 0.5 0.10 0.06666667
3 1e+01 0.5 0.08 0.09189366
4 1e+02 0.5 0.09 0.08755950
5 1e+03 0.5 0.10 0.11547005
6 1e-01 1.0 0.22 0.15491933
7 1e+00 1.0 0.10 0.06666667
8 1e+01 1.0 0.10 0.10540926
9 1e+02 1.0 0.11 0.09944289
10 1e+03 1.0 0.14 0.13498971
11 1e-01 2.0 0.22 0.15491933
12 1e+00 2.0 0.12 0.09189366
13 1e+01 2.0 0.10 0.10540926
14 1e+02 2.0 0.12 0.10327956
15 1e+03 2.0 0.16 0.13498971
16 1e-01 3.0 0.22 0.15491933
17 1e+00 3.0 0.12 0.10327956
18 1e+01 3.0 0.10 0.08164966
19 1e+02 3.0 0.14 0.10749677
20 1e+03 3.0 0.17 0.14181365
21 1e-01 4.0 0.22 0.15491933
22 1e+00 4.0 0.12 0.10327956
23 1e+01 4.0 0.10 0.09428090
24 1e+02 4.0 0.13 0.11595018
25 1e+03 4.0 0.18 0.13165612
table(true <- dat[-train,"y"], pred=predict(tune.out$best.model,newx=dat[-train,]))
pred
1 2
1 56 16
2 21 7
plot(tune.out$best.model, dat[train,])
ESL
Source: link
Averaging over a collection of decision trees makes the predictions more stable.
One of the random forest features which makes it perform well, is the introduction of randomness into the candidate splitting variables, which reduces correlation between the generated trees.
What this means is that, at each splitting levels, the pool of the potential splitting variables does not contain all variables, but only a random subset of them.
In R random forest can be computed with the function randomForest() in randomForest package. There are two important parameters for a random forest prediction:
mtry – Number of variables randomly sampled as candidates at each split.ntree– Number of trees generated (to be averaged over).We will use the iris dataset to show how classification can be done with randomForest(). Regression can be done using the same function calls.
Iris is a famous (Fisher’s or Anderson’s) data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
## Classification:
data(iris)
# set seed for reproducibility since random forest is a random predictor
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
proximity=TRUE)
print(iris.rf)
Call:
randomForest(formula = Species ~ ., data = iris, importance = TRUE, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 5.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 46 4 0.08
virginica 0 4 46 0.08
## Look at variable importance:
round(importance(iris.rf), 2)
setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length 6.04 7.85 7.93 11.51 8.77
Sepal.Width 4.40 1.03 5.44 5.40 2.19
Petal.Length 21.76 31.33 29.64 32.94 42.54
Petal.Width 22.84 32.67 31.68 34.50 45.77
Using random forest regression, predict the housing prices in the Boston dataset used previously for the exercise on the Lasso regression.
data(Boston)
set.seed(123)
trainIdx <- sample(1:nrow(Boston), round(0.7 * nrow(Boston)))
head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
# (...?)
An example from ESL Chapter 14:
Now we will do an example using the built in dataset on violent crime rates in the US from 1975.
The code is adapted from ISL.
?USArrests
head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
states <- row.names(USArrests)
Mean and variance of the crime rates across all states:
apply(USArrests, 2, mean)
Murder Assault UrbanPop Rape
7.788 170.760 65.540 21.232
apply(USArrests, 2, var)
Murder Assault UrbanPop Rape
18.97047 6945.16571 209.51878 87.72916
Compute a PCA:
In R you can use the function prcomp() to do PCA. Check the help page ?prcomp for details.
The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy. The print method for these objects prints the results in a nice format and the plot method produces a scree plot.
prcomp() is a preferred method over princomp.
pr.out <- prcomp(USArrests, scale=TRUE)
names(pr.out)
[1] "sdev" "rotation" "center" "scale" "x"
pr.out$center
Murder Assault UrbanPop Rape
7.788 170.760 65.540 21.232
pr.out$scale
Murder Assault UrbanPop Rape
4.355510 83.337661 14.474763 9.366385
pr.out$rotation
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
dim(pr.out$x)
[1] 50 4
The output of prcomp() is a list containing the following:
sdev: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the co-variance/correlation matrix, X^TX)rotation: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).x: the value of the rotated data (the centered (and scaled if requested) data multiplied by the rotation matrix) is returned.center, scale the centering and scaling used, or FALSEEach principal component loading and score vector is unique, up to a sign flip
biplot(pr.out, scale=0, cex = 0.8)
Each principal component loading and score vector is unique, up to a sign flip So another software of R package could e.g. return this plot instead:
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0, cex = 0.8)
Assault, Murder, and Rape, with much less weight on UrbanPop. Hence this component roughly corresponds to a measure of overall rates of serious crimes.UrbanPop. Hence, it roughly corresponds to the level of urbanization of the state.UrbanPop is far from them.UrbanPop variable is less correlated with the other three.Choose the smallest number of principal components that are required in order to explain a sizable amount of the variation in the data.
Look for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an elbow in the scree plot
pr.out$sdev
[1] 1.5748783 0.9948694 0.5971291 0.4164494
pr.var <- pr.out$sdev^2
pr.var
[1] 2.4802416 0.9897652 0.3565632 0.1734301
pve <- pr.var/sum(pr.var)
pve
[1] 0.62006039 0.24744129 0.08914080 0.04335752
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')
MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot.
There are different types of MDS methods including, Classical MDS, Metric MDS and Non-metric MDS. The details on the differences ca be found on:
Ekman studied how people perceive colors. He collected data where 31 subjects rated the similarity pf each pair of 14 colors on a a 5-point scale (0 = no similarity, 4 = identical). After averaging the 21 similarities ratings were divided by 4 to transform them into a unit interval. The 14 colors studied had wavelengths between 434 and 674 nm.
ekmanSim <- readRDS("./data/ekman.rds")
# convert similarities to dissimilarities
ekmanDist <- 1- ekmanSim
wavelengths <- round(seq( 434, 674, length.out = 14))
Use cmdscale() built-in function for classical MDS. Metric iterative MDS and non-metric MDS function are available in a package smacof and other packages are also compared here.
ekmanMDS <- cmdscale(ekmanDist, k = 2)
res <- data.frame(ekmanMDS)
res$wavelength <- factor(wavelengths, levels = wavelengths)
ggplot(res, aes(X1, X2)) + geom_point() + theme_bw() +
geom_text(aes(label = wavelength), hjust=0.5, vjust=-1, size = 3)
I mapped the wavelengths to hexadecimal colors using this website.
colorsDF <- data.frame(
wavelength = wavelengths,
hex = c("#2800ff", "#0051ff", "#00aeff", "#00fbff", "#00ff28", "#4eff00",
"#92ff00", "#ccff00", "#fff900", "#ffbe00", "#ff7b00", "#ff3000",
"#ff0000", "#ff0000"))
ggplot(res, aes(X1, X2)) +
geom_point(aes(color = wavelength), size = 2) + theme_bw() +
geom_text(aes(label = wavelength), hjust=0.5, vjust=-1, size = 3) +
scale_color_manual(values = as.character(colorsDF$hex))
MDS reproduces the recognizable 2D color wheel
What is tSNE?
The following example shows how to calculate and plot a 2D t-SNE projection using the Barnes-Hut-SNE algorithm, using the Rtsne package for R. This example was worked out by Lukas Weber.
Source code: github link
# Skip the block above if it takes too long to load.
# The subsetted data has beed porvided in a smaller file
dat <- read.csv("./data/healthyHumanBoneMarrow_small.csv")
dim(dat)
[1] 1000 36
dat[1:10, 1:6]
Event. Time Cell.Length X191.DNA X193.DNA X115.CD45
1 100704 261232 36.35262 1052.8925 2184.2791 606.56268
2 9863 24602 25.07553 697.9535 1158.0222 192.41901
3 466415 1084484 34.44039 409.5371 542.7836 98.22443
4 216327 623282 43.57069 998.1034 1990.9073 482.09525
5 227911 925587 48.47278 697.8970 1274.9041 212.06926
6 328317 900941 29.00149 921.0044 1923.2419 284.07599
7 295168 732036 21.45761 563.6906 1198.7548 22.76224
8 198694 522453 52.10402 425.8115 657.4357 137.40990
9 106346 179388 22.12231 1106.9290 1953.5457 32.46744
10 413740 910973 27.23294 481.7126 1139.8833 209.80591
# select 13 protein markers to use in calculation of t-SNE projection
# CD11b, CD123, CD19, CD20, CD3, CD33, CD34, CD38, CD4, CD45, CD45RA, CD8, CD90
# (see Amir et al. 2013, Supplementary Tables 1 and 2)
colnames_proj <- colnames(dat)[c(11, 23, 10, 16, 7, 22, 14, 28, 12, 6, 8, 13, 30)]
colnames_proj # check carefully!
[1] "X144.CD11b" "X160.CD123" "X142.CD19" "X147.CD20" "X110.111.112.114.CD3" "X158.CD33"
[7] "X148.CD34" "X167.CD38" "X145.CD4" "X115.CD45" "X139.CD45RA" "X146.CD8"
[13] "X170.CD90"
# arcsinh transformation
# (see Amir et al. 2013, Online Methods, "Processing of mass cytometry data")
asinh_scale <- 5
dat <- asinh(dat / asinh_scale) # transforms all columns! including event number etc
# prepare data for Rtsne
dat <- dat[, colnames_proj] # select columns to use
dat <- dat[!duplicated(dat), ] # remove rows containing duplicate values within rounding
dim(dat)
[1] 999 13
library(Rtsne)
# run Rtsne (Barnes-Hut-SNE algorithm)
# without PCA step (see Amir et al. 2013, Online Methods, "viSNE analysis")
set.seed(123) # set random seed
rtsne_out <- Rtsne(as.matrix(dat), pca = FALSE, verbose = TRUE)
Read the 999 x 13 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 999
Done in 0.13 seconds (sparsity = 0.116284)!
Learning embedding...
Iteration 50: error is 69.448402 (50 iterations in 0.64 seconds)
Iteration 100: error is 60.170135 (50 iterations in 0.52 seconds)
Iteration 150: error is 59.830064 (50 iterations in 0.52 seconds)
Iteration 200: error is 59.790362 (50 iterations in 0.53 seconds)
Iteration 250: error is 2.494854 (50 iterations in 0.50 seconds)
Iteration 300: error is 1.072682 (50 iterations in 0.48 seconds)
Iteration 350: error is 0.932446 (50 iterations in 0.48 seconds)
Iteration 400: error is 0.893893 (50 iterations in 0.53 seconds)
Iteration 450: error is 0.878459 (50 iterations in 0.51 seconds)
Iteration 500: error is 0.865561 (50 iterations in 0.49 seconds)
Iteration 550: error is 0.856887 (50 iterations in 0.50 seconds)
Iteration 600: error is 0.852200 (50 iterations in 0.50 seconds)
Iteration 650: error is 0.847460 (50 iterations in 0.49 seconds)
Iteration 700: error is 0.844339 (50 iterations in 0.50 seconds)
Iteration 750: error is 0.841766 (50 iterations in 0.49 seconds)
Iteration 800: error is 0.838928 (50 iterations in 0.52 seconds)
Iteration 850: error is 0.837514 (50 iterations in 0.49 seconds)
Iteration 900: error is 0.835641 (50 iterations in 0.48 seconds)
Iteration 950: error is 0.835091 (50 iterations in 0.54 seconds)
Iteration 999: error is 0.834370 (50 iterations in 0.51 seconds)
Fitting performed in 10.21 seconds.
# plot 2D t-SNE projection
plot(rtsne_out$Y, asp = 1, pch = 20, col = "blue",
cex = 0.75, cex.axis = 1.25, cex.lab = 1.25, cex.main = 1.5,
xlab = "t-SNE dimension 1", ylab = "t-SNE dimension 2",
main = "2D t-SNE projection")
Download MNIST data of the digits images from the Kaggle competition. The code is adapted from the one here.
The ./data/train.csv and ./data/test.csv files are data on the 28x28 pixel images of digits (0-9). The data is composed of:
label column denoting the digit on the imagepixel0 through pixel783 contain information on the pixel intensity (on the scale of 0-255), and together form the vectorized version of the 28x28 pixel digit image# Skip the previous block and load the already subsetted training data (faster).
train <- read.csv("./data/train_small.csv")
dim(train)
[1] 1000 786
# (... ?)
What do you observe? How does tSNE compare with PCA in this case?
Check out clustering methods available on CRAN.
Demonstration of how k-means clustering works can be found in this animation
Drawbacks:
The choice of the number of clusters, \(k\) can be made by considering statistics such as:
One of the application of k-means clustering is image segmentation. The code provided are adapted from the found here.
Download your favorite image / do a quick Google search for one / use the one provided. Make sure that your picture is not too big (e.g. around is fine 400 x 600 px). Otherwise, the code might take a while to run.
The image provided is a picture of colorful tulips in the Netherlands downloaded from here.
library(jpeg)
url <- "http://www.infohostels.com/immagini/news/2179.jpg"
# Download the file and save it as "Image.jpg" in the directory
dFile <- download.file(url, "./figures/Image.jpg")
trying URL 'http://www.infohostels.com/immagini/news/2179.jpg'
Content type 'image/jpeg' length 174940 bytes (170 KB)
==================================================
downloaded 170 KB
img <- readJPEG("./figures/Image.jpg") # Read the image
(imgDm <- dim(img))
[1] 480 960 3
Represent the 3D array as a data frame with each row representing a single pixel. The columns x and y give the pixel location, and R, G, B denote the pixel intensity in red, green, blue respectively.
# Assign RGB channels to data frame
imgRGB <- data.frame(
x = rep(1:imgDm[2], each = imgDm[1]),
y = rep(imgDm[1]:1, imgDm[2]),
R = as.vector(img[,,1]),
G = as.vector(img[,,2]),
B = as.vector(img[,,3])
)
# ggplot theme to be used
plotTheme <- function() {
theme(
panel.background = element_rect(
size = 3,
colour = "black",
fill = "white"),
axis.ticks = element_line(
size = 2),
panel.grid.major = element_line(
colour = "gray80",
linetype = "dotted"),
panel.grid.minor = element_line(
colour = "gray90",
linetype = "dashed"),
axis.title.x = element_text(
size = rel(1.2),
face = "bold"),
axis.title.y = element_text(
size = rel(1.2),
face = "bold"),
plot.title = element_text(
size = 20,
face = "bold",
vjust = 1.5)
)
}
# Plot the image
ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = rgb(imgRGB[c("R", "G", "B")])) +
labs(title = "Original Image: Colorful Flowers") +
xlab("x") +
ylab("y") +
plotTheme()
Choose the number of clusters (colors in this case) and do k-means clustering with kmeans() built-in function.
kClusters <- 2
kMeans <- kmeans(imgRGB[, c("R", "G", "B")], centers = kClusters)
kColours <- rgb(kMeans$centers[kMeans$cluster,])
Then plot the results.
ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = kColours) +
labs(title = paste("k-Means Clustering of", kClusters, "Colours")) +
xlab("x") +
ylab("y") +
plotTheme()
kClusters <- 6
kMeans <- kmeans(imgRGB[, c("R", "G", "B")], centers = kClusters)
names(kMeans)
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
kColours <- rgb(kMeans$centers[kMeans$cluster,])
Then plot the results.
ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = kColours) +
labs(title = paste("k-Means Clustering of", kClusters, "Colours")) +
xlab("x") +
ylab("y") +
plotTheme()
Alexander Calder’s mobile
If you don’t want to specify or if it is hard to choose the number of clusters based on the diagnostics you can try hierarchical clustering, which provides a graphical tree-based representation of the data, called a dendogram, that would allow you to evaluate where the cutoff for grouping should occur.
The hierarchical clustering algorithm summary is given below
Source: ISL
The results for hierarchical clustering will differ based on your choice of:
Again, we will use the Fisher’s Iris dataset used before in the random forest section. We will show that using hierarchical clustering of the observed attributes one can cluster flowers into groups corresponding to their species.
?iris
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
We will use the built-in function hclust() to perform hierarchical clustering. Only the data on the petal dimensions will be used for computing the distances between flower.
?hclust
# We use the Euclidean distance for the dissimilarities between flowers
distMat <- dist(iris[, 3:4])
# We use the "complete" linkage method for computing the cluster distances.
clusters <- hclust(distMat, method = "complete")
plot(clusters)
Inspecting the plot we see that a reasonable choice of the number of clusters is either 3 or 4.
plot(clusters)
abline(a = 2, b = 0, col = "blue")
abline(a = 3, b = 0, col = "blue")
Let’s pick 3 clusters. To get the leaves/observations (flowers in this case) assignments using only 3 clusters from the chopped tree we can use a cutree() function.
(clusterCut <- cutree(clusters, 3))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 2 2 3 2 3 3 3 3 2 3 3 2 3 2 3 2 3 2
[74] 2 3 3 2 2 2 3 3 3 3 2 2 2 2 3 3 3 3 2 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[147] 2 2 2 2
table(clusterCut, iris$Species)
clusterCut setosa versicolor virginica
1 50 0 0
2 0 21 50
3 0 29 0
From the table we see that the sentosa and virginica were correctly assigned to separate groups. However, the method had problems to group the versicolor flowers into a separate cluster.
We can try another linkage method like “average” and see if we perform better.
?hclust
# We use the Euclidean distance for the dissimilarities between flowers
distMat <- dist(iris[, 3:4])
# We use the "complete" linkage method for computing the cluster distances.
clusters <- hclust(distMat, method = "average")
plot(clusters)
Here we can choose 3 or 5 clusters:
plot(clusters)
abline(a = 1.4, b = 0, col = "blue")
abline(a = 0.8, b = 0, col = "blue")
Again we choose 3 clusters
clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)
clusterCut setosa versicolor virginica
1 50 0 0
2 0 45 1
3 0 5 49
We see that this time we do a little better in clustering the versicolors together.
We now plot the flowers using petal dimensions as coordinates, and show their cluster assignment as well as color points by species.
ggplot(iris, aes(Petal.Length, Petal.Width)) + theme_bw() +
geom_text(aes(label = clusterCut), vjust = -1) +
geom_point(aes(color = Species))